Goto

Collaborating Authors

 The Bahamas



Toxic 'forever chemicals' linked to cancer now associated with major pregnancy complication

Daily Mail - Science & tech

Senator accused of steamy affair with her bodyguard in bombshell lawsuit from his WIFE: 'Bring MDMA so I can guide you' Socialite who accused playboy twins of sex attack at Hamptons'castle' is found dead in unexplained circumstances Amy Schumer's friends reveal true meaning of thin bikini pictures and why they're'monitoring her'... as depth of ex Chris Fischer's heartbreak is laid bare Hunter Biden's stripper baby mama asks for him to be ARRESTED over claims he is still failing to pay her child support Ellen Greenberg's fiancé Sam Goldberg breaks cover as feds reopen probe into her'suicide'... and late teacher's mother shares incredible sign sent from beyond the grave Nicole Richie addresses her daughter's new identity after unveiling transformation on her 18th birthday '90s Vogue model Niki Taylor looks amazing as she sizzles at age 50 for new campaign Karoline Leavitt reveals the thinking behind Trump's call to cancel elections Family of Tyler Robinson's transgender lover speaks ...


OpenAI Is Asking Contractors to Upload Work From Past Jobs to Evaluate the Performance of AI Agents

WIRED

To prepare AI agents for office work, the company is asking contractors to upload projects from past jobs, leaving it to them to strip out confidential and personally identifiable information. OpenAI is asking third-party contractors to upload real assignments and tasks from their current or previous workplaces so that it can use the data to evaluate the performance of its next-generation AI models, according to records from OpenAI and the training data company Handshake AI obtained by WIRED. The project appears to be part of OpenAI's efforts to establish a human baseline for different tasks that can then be compared with AI models. In September, the company launched a new evaluation process to measure the performance of its AI models against human professionals across a variety of industries. OpenAI says this is a key indicator of its progress towards achieving AGI, or an AI system that outperforms humans at most economically valuable tasks. "We've hired folks across occupations to help collect real-world tasks modeled off those you've done in your full-time jobs, so we can measure how well AI models perform on those tasks," reads one confidential document from OpenAI.


Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

Piedrahita, David Guzman, Strauss, Irene, Schölkopf, Bernhard, Mihalcea, Rada, Jin, Zhijing

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left--right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy--authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs.


Evaluating Long-Context Reasoning in LLM-Based WebAgents

Chung, Andy, Zhang, Yichi, Lin, Kaixiang, Rawal, Aditya, Gao, Qiaozi, Chai, Joyce

arXiv.org Artificial Intelligence

As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.


Trapped by the swipe? Dating apps are designed to keep singles 'swiping and spending' rather than finding 'The One', experts warn

Daily Mail - Science & tech

Record cold for 235 million Americans starting in just HOURS as polar vortex brings'most extreme cold on Earth' Is this the END of Ozempic? Nashville neighbors can see what's REALLY going on with Nicole Kidman. Even I was once overweight. So trust me, this 30 DAY detox plan will get you thin WITHOUT Ozempic... but if you want to stay skinny, you'll have to make one major sacrifice: JILLIAN MICHAELS Mom who spent 10 years'gentle parenting' admits it was a mistake: 'My kids are anxious, insecure and entitled' Worrying side-effect of creatine you aren't being warned about: Cheap supplement is hailed as a'miracle' - but here's how to tell if YOUR brand is doing more harm than good Amazon warns 300 million shoppers of Cyber Monday scam... and how to avoid it'Murder for hire' housewife begs Bahamas judge to free her from GPS shackles so she can start a shocking new career Trump suffers fresh legal blow as Alina Habba's disqualification is upheld by appeals court Trump sparks fury as he frees $1.6 BILLION fraudster just days into seven-year-sentence I was drinking 130 units of alcohol a week and knew it was time to cut down. Then, I discovered this no-effort miracle solution.




Unlocking the Potential of Global Human Expertise

Neural Information Processing Systems

For example, in the Pandemic Response Challenge experiment, the context consisted of data about the geographic region for which the predictions were made, e.g., historical data of COVID-19 cases and intervention policies; actions were future schedules of intervention policies for the region; and outcomes were predicted future cases of COVID-19 along with the stringency


Language Specific Knowledge: Do Models Know Better in X than in English?

Agarwal, Ishika, Bozdag, Nimet Beyza, Hakkani-Tür, Dilek

arXiv.org Artificial Intelligence

Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. Our contributions are two-fold. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an "expert language" for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection -- for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages -- and the goal is to select the optimal language for the query. Second, we introduce simple to strong baselines to test this problem. Additionally, as a first-pass solution to this novel problem, we design LSKExtractor to benchmark the language-specific knowledge present in a language model and then exploit it during inference. To test our framework, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, LSKExtractor achieves up to 10% relative improvement across datasets, and is competitive against strong baselines, while being feasible in real-world settings. Broadly, our research contributes to the open-source development (https://github.com/agarwalishika/LSKExtractor/tree/main) of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.